Santander Customer Satisfaction

imagem

From frontline support teams to C-suites, customer satisfaction is a key measure of success. Unhappy customers don't stick around. What's more, unhappy customers rarely voice their dissatisfaction before leaving. Santander Bank is asking for help to identify dissatisfied customers early in their relationship. Doing so would allow Santander to take proactive steps to improve a customer's happiness before it's too late. The project has hundreds of anonymized features to predict if a customer is satisfied or dissatisfied with their banking experience. All the project info can be found in this link. The metric used to evaluate the score is AUC (Area Under the Curve)
In [1]:
import warnings
warnings.simplefilter(action = 'ignore', category = FutureWarning)

import pandas as pd
import numpy as np


# Plot libraries
import matplotlib
from matplotlib import transforms, pyplot as plt
import seaborn as sns
import bokeh
import plotly.express as px
import plotly.graph_objects as go

# Data preparation
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from imblearn.over_sampling import SMOTE, ADASYN

# Cross validation
from sklearn.model_selection import RepeatedStratifiedKFold, StratifiedKFold, PredefinedSplit
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# Models
from sklearn.linear_model import LogisticRegression
import xgboost as xgb

# Models score
from sklearn.metrics import roc_auc_score, roc_curve

%matplotlib inline
%config Completer.use_jedi = False

Configure matplotib

In [2]:
# configure plot font family to Arial
plt.rcParams['font.family'] = 'Arial'
# configure mathtext bold and italic font family to Arial
matplotlib.rcParams['mathtext.fontset'] = 'custom'
matplotlib.rcParams['mathtext.bf'] = 'Arial:bold'
matplotlib.rcParams['mathtext.it'] = 'Arial:italic'
In [3]:
# define colors
GRAY1, GRAY2, GRAY3 = '#231F20', '#414040', '#555655'
GRAY4, GRAY5, GRAY6 = '#646369', '#76787B', '#828282'
GRAY7, GRAY8, GRAY9 = '#929497', '#A6A6A5', '#BFBEBE'
BLUE1, BLUE2, BLUE3, BLUE4 = '#174A7E', '#4A81BF', '#94B2D7', '#94AFC5'
RED1, RED2 = '#C3514E', '#E6BAB7'
GREEN1, GREEN2 = '#0C8040', '#9ABB59'
ORANGE1 = '#F79747'

Import Data

In [4]:
df =  pd.read_csv('data/train.csv')
In [5]:
df.head()
Out[5]:
ID var3 var15 imp_ent_var16_ult1 imp_op_var39_comer_ult1 imp_op_var39_comer_ult3 imp_op_var40_comer_ult1 imp_op_var40_comer_ult3 imp_op_var40_efect_ult1 imp_op_var40_efect_ult3 ... saldo_medio_var33_hace2 saldo_medio_var33_hace3 saldo_medio_var33_ult1 saldo_medio_var33_ult3 saldo_medio_var44_hace2 saldo_medio_var44_hace3 saldo_medio_var44_ult1 saldo_medio_var44_ult3 var38 TARGET
0 1 2 23 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 39205.170000 0
1 3 2 34 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 49278.030000 0
2 4 2 23 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 67333.770000 0
3 8 2 37 0.0 195.0 195.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 64007.970000 0
4 10 2 39 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 117310.979016 0

5 rows × 371 columns

We will firstly dropt the ID variable has they have no valuable information to train the model

In [6]:
df = df.drop('ID', axis=1)

Data Exploration

In [7]:
df.shape
Out[7]:
(76020, 370)

There are 76020 rows (observations) and 371 variables.

In [8]:
df.dtypes.value_counts()
Out[8]:
int64      259
float64    111
dtype: int64

All variables are numeric, being 260 integers and 111 as float.

It important to see if there is any variable with NA values.

In [9]:
df.isna().sum().sum()
Out[9]:
0

There are no NA values in the dataset.

Another impotant part is to evaluate if there is any duplicated row in the data set.

In [10]:
df.duplicated().sum()
Out[10]:
4807

There are 4807 duplicated rows. We will need to drop them.

In [11]:
df.drop_duplicates(inplace=True)
df.shape
Out[11]:
(71213, 370)

Lets take a look to see if some of these varaibles might be categorical. First we will calculate the number of unique values per column.

In [12]:
counts = df.nunique()
In [13]:
counts_df = pd.DataFrame({'variable':counts.index, 'count':counts.values})
In [14]:
counts_df.groupby('count').count()
Out[14]:
variable
count
1 34
2 106
3 31
4 20
5 14
... ...
14778 1
15730 1
16940 1
17330 1
57736 1

115 rows × 1 columns

It can be seen that 34 variables have constant values, not adding any relevant info to the data

In [15]:
# Take variables with no variability
df2 = df[df.columns[df.nunique() != 1]]
In [16]:
df2.shape
Out[16]:
(71213, 336)

The number of variables is now 336.

Another option to drop some variables is to see if there is a perfect negative or positive correlation between them. In order to do that, we calculate the matrix corelation, select the upper part and drop the columns which the abs(corr) == 1.

In [17]:
# Get matrix corelation
mtcorr = df2.corr().abs()
mtcorr
Out[17]:
var3 var15 imp_ent_var16_ult1 imp_op_var39_comer_ult1 imp_op_var39_comer_ult3 imp_op_var40_comer_ult1 imp_op_var40_comer_ult3 imp_op_var40_efect_ult1 imp_op_var40_efect_ult3 imp_op_var40_ult1 ... saldo_medio_var33_hace2 saldo_medio_var33_hace3 saldo_medio_var33_ult1 saldo_medio_var33_ult3 saldo_medio_var44_hace2 saldo_medio_var44_hace3 saldo_medio_var44_ult1 saldo_medio_var44_ult3 var38 TARGET
var3 1.000000 0.004746 0.001912 0.006112 0.006972 0.001556 0.001734 0.000543 0.000627 0.001345 ... 0.000735 0.000504 0.000655 0.000686 0.000633 0.000521 0.000757 0.000798 0.000054 0.004146
var15 0.004746 1.000000 0.043180 0.091026 0.097441 0.042609 0.048382 0.008620 0.009455 0.035731 ... 0.029501 0.017300 0.028679 0.029345 0.029368 0.018937 0.033010 0.033763 0.007049 0.097508
imp_ent_var16_ult1 0.001912 0.043180 1.000000 0.040519 0.034149 0.009760 0.009227 0.000544 0.002454 0.011384 ... 0.000927 0.000676 0.000605 0.000599 0.002599 0.000658 0.004988 0.006520 0.000037 0.000007
imp_op_var39_comer_ult1 0.006112 0.091026 0.040519 1.000000 0.886118 0.342703 0.295160 0.032135 0.054669 0.249162 ... 0.016196 0.011564 0.012365 0.013489 0.009234 0.005360 0.011412 0.010533 0.012718 0.010763
imp_op_var39_comer_ult3 0.006972 0.097441 0.034149 0.886118 1.000000 0.316635 0.355645 0.028941 0.055310 0.247626 ... 0.027276 0.021677 0.018226 0.020323 0.008322 0.006184 0.010399 0.009544 0.013450 0.003687
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
saldo_medio_var44_hace3 0.000521 0.018937 0.000658 0.005360 0.006184 0.000513 0.000565 0.000181 0.000209 0.000445 ... 0.000234 0.000161 0.000209 0.000219 0.332164 1.000000 0.229146 0.213178 0.003663 0.002636
saldo_medio_var44_ult1 0.000757 0.033010 0.004988 0.011412 0.010399 0.000303 0.000564 0.000271 0.000312 0.000302 ... 0.000797 0.000241 0.002474 0.002185 0.818296 0.229146 1.000000 0.968166 0.003278 0.003204
saldo_medio_var44_ult3 0.000798 0.033763 0.006520 0.010533 0.009544 0.000423 0.000658 0.000286 0.000329 0.000400 ... 0.000954 0.000254 0.002884 0.002552 0.710587 0.213178 0.968166 1.000000 0.003056 0.003112
var38 0.000054 0.007049 0.000037 0.012718 0.013450 0.016695 0.015650 0.000308 0.000693 0.003608 ... 0.004470 0.001616 0.004275 0.004311 0.002906 0.003663 0.003278 0.003056 1.000000 0.020117
TARGET 0.004146 0.097508 0.000007 0.010763 0.003687 0.003233 0.000361 0.019872 0.020641 0.003198 ... 0.003649 0.002511 0.003269 0.003420 0.003280 0.002636 0.003204 0.003112 0.020117 1.000000

336 rows × 336 columns

In [18]:
# Selecting the upper part of the correlation matrix trianguel
upper = mtcorr.where(np.triu(np.ones(mtcorr.shape), k=1).astype(np.bool_))
upper
Out[18]:
var3 var15 imp_ent_var16_ult1 imp_op_var39_comer_ult1 imp_op_var39_comer_ult3 imp_op_var40_comer_ult1 imp_op_var40_comer_ult3 imp_op_var40_efect_ult1 imp_op_var40_efect_ult3 imp_op_var40_ult1 ... saldo_medio_var33_hace2 saldo_medio_var33_hace3 saldo_medio_var33_ult1 saldo_medio_var33_ult3 saldo_medio_var44_hace2 saldo_medio_var44_hace3 saldo_medio_var44_ult1 saldo_medio_var44_ult3 var38 TARGET
var3 NaN 0.004746 0.001912 0.006112 0.006972 0.001556 0.001734 0.000543 0.000627 0.001345 ... 0.000735 0.000504 0.000655 0.000686 0.000633 0.000521 0.000757 0.000798 0.000054 0.004146
var15 NaN NaN 0.043180 0.091026 0.097441 0.042609 0.048382 0.008620 0.009455 0.035731 ... 0.029501 0.017300 0.028679 0.029345 0.029368 0.018937 0.033010 0.033763 0.007049 0.097508
imp_ent_var16_ult1 NaN NaN NaN 0.040519 0.034149 0.009760 0.009227 0.000544 0.002454 0.011384 ... 0.000927 0.000676 0.000605 0.000599 0.002599 0.000658 0.004988 0.006520 0.000037 0.000007
imp_op_var39_comer_ult1 NaN NaN NaN NaN 0.886118 0.342703 0.295160 0.032135 0.054669 0.249162 ... 0.016196 0.011564 0.012365 0.013489 0.009234 0.005360 0.011412 0.010533 0.012718 0.010763
imp_op_var39_comer_ult3 NaN NaN NaN NaN NaN 0.316635 0.355645 0.028941 0.055310 0.247626 ... 0.027276 0.021677 0.018226 0.020323 0.008322 0.006184 0.010399 0.009544 0.013450 0.003687
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
saldo_medio_var44_hace3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN 0.229146 0.213178 0.003663 0.002636
saldo_medio_var44_ult1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN 0.968166 0.003278 0.003204
saldo_medio_var44_ult3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN 0.003056 0.003112
var38 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.020117
TARGET NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

336 rows × 336 columns

In [19]:
# Choose columns where there is a absolute correlation of 1
col_drop = [ col for col in mtcorr.columns if any(upper[col] == 1) ]
col_drop
Out[19]:
['ind_var13_medio',
 'ind_var18',
 'ind_var26',
 'ind_var25',
 'ind_var29_0',
 'ind_var29',
 'ind_var32',
 'ind_var34',
 'ind_var37',
 'ind_var39',
 'num_var13_medio',
 'num_var18',
 'num_var26',
 'num_var25',
 'num_var29_0',
 'num_var29',
 'num_var32',
 'num_var34',
 'num_var37',
 'num_var39',
 'saldo_var29',
 'delta_num_reemb_var13_1y3',
 'delta_num_reemb_var17_1y3',
 'delta_num_reemb_var33_1y3',
 'delta_num_trasp_var17_in_1y3',
 'delta_num_trasp_var17_out_1y3',
 'delta_num_trasp_var33_in_1y3',
 'delta_num_trasp_var33_out_1y3',
 'num_meses_var13_medio_ult3',
 'saldo_medio_var13_medio_ult1']
In [20]:
# Dropping perfectly correlated variables
df3 = df2.drop(df2[col_drop], axis=1)
In [21]:
df3.shape
Out[21]:
(71213, 306)
In [212]:
final_col = df3.columns

We were able to reduce the number of colums from 371 to 306, and the number of rows from 76020 to 71213.

Target exporation

In [22]:
label_count = pd.DataFrame(df3.TARGET.replace({0:'Satisfied', 1: 'Not Satisfied'}).value_counts())
label_count.rename(columns = {'TARGET': 'COUNT'}, inplace=True)
label_count['PERC'] = round(label_count.COUNT*100/sum(label_count.COUNT),2)
label_count
Out[22]:
COUNT PERC
Satisfied 68398 96.05
Not Satisfied 2815 3.95
In [23]:
fig = px.bar(x = label_count.PERC, 
           y = label_count.index,
            template = 'simple_white',
            color = label_count.index,
            color_discrete_map = {'Satisfied' : GRAY9,
                                 'Not Satisfied': BLUE1},
             opacity = 0.7,
            orientation='h',
            labels = {'x':'Percentage',
                     'y':'TARGET'},
            title = 'TARGET')

fig.update_layout(showlegend=False)

fig.update_layout(
    font_family="Arial",
    font_size = 20,
    font_color= GRAY6,
    title_font_family="Arial Bold",
    title_font_size = 25,
    title_font_color= GRAY3,
    legend_title_font_color=GRAY6,
    xaxis = {"ticksuffix": " %",
             'side': "top"} 
)

fig.update_layout(yaxis={'visible': True, 'showticklabels': True, 'title':'', 'linecolor': GRAY6},
                 xaxis={'visible': True, 'showticklabels': True, 'title':'', 'linecolor':GRAY6})

fig.show()

We can see that the target variable is as a categorical type and that both categories are not balanced. Being 0 to satisfied clients and 1 to non-satisfied clients. The bar plot improved the visulization and it can be seen that there is only 3.95% of 'Not Satisfied' clients

In [24]:
df3.TARGET = df3.TARGET.astype('category')

Split data into train, vaidation and test

In [25]:
X = df3.drop('TARGET', axis = 1)
y = df3.TARGET
In [26]:
# Set aside 25% for test/validation data for evaluation
X_train, X_vali_test, y_train, y_vali_test =  train_test_split(X, y, 
                                                     test_size=0.25, 
                                                     shuffle= True)

# Set aside 15 % for validation and 10% for test
X_vali, X_test, y_vali, y_test =  train_test_split(X_vali_test, y_vali_test, 
                                                     test_size=0.40, 
                                                     shuffle= True)



print('------ Data ------ ')
print('X_train shape: {}'.format(X_train.shape))
print('X_vali shape: {}'.format(X_vali.shape))
print('X_test shape: {}'.format(X_test.shape))
print('\n------ Label ------ ')
print('y_train shape: {}'.format(y_train.shape))
print('y_vali shape: {}'.format(y_vali.shape))
print('y_test shape: {}'.format(y_test.shape))
------ Data ------ 
X_train shape: (53409, 305)
X_vali shape: (10682, 305)
X_test shape: (7122, 305)

------ Label ------ 
y_train shape: (53409,)
y_vali shape: (10682,)
y_test shape: (7122,)

Balance train data

In [27]:
y_train.value_counts()
Out[27]:
0    51309
1     2100
Name: TARGET, dtype: int64
In [28]:
X_res, y_res = SMOTE( n_jobs = -1).fit_resample(X_train, y_train)
In [29]:
y_res.value_counts()
Out[29]:
0    51309
1    51309
Name: TARGET, dtype: int64

Dimensionality Reduction

There are several techniques to decompose the attributes into a smaller subset. These can be useful for data exploration, visualization or to build predictive models or clustering. Because our current dataframe has a lot of variables, we will build a pipeline do reduce the dimensionality of the data.

The following PCA method will be used to reduce the dimensionality of the data.

PCA

The PCA (Principal Component Analysis) is one the main methods to reduce the dimensionality of the data. It uses a matrix and linearly combines multiple columns from the original data in order to maximize its variance. Each PCA is orthogonal relative to the other PCA's and are order by it strength to explain variance.

In [30]:
pca_pipe = Pipeline([('std', StandardScaler()),
                    ('pca', PCA())])
In [31]:
train_pca = pca_pipe.fit_transform(X_res)
In [32]:
pca_number = pca_pipe.named_steps['pca'].explained_variance_ratio_
In [33]:
fig = go.Figure()

fig.add_trace(go.Scatter(
    x = np.arange(pca_number.shape[0]), 
    y = np.cumsum(pca_number),
    mode = 'lines',
    line = dict( color = BLUE2,
               width = 4)))

# Edit the layout
fig.update_layout(title='PCA',
                   xaxis_title="Number of PCA's",
                   yaxis_title='Explained Variance',
                   template = 'simple_white',
                  font_family="Arial",
                  font_size = 20, 
                  font_color= GRAY6, 
                  title_font_family="Arial Bold", 
                  title_font_size = 25, 
                  title_font_color= GRAY3, 
                  legend_title_font_color=GRAY6)

fig.show()

From the interactive plot it is possible to see that with less than 30% of the the variables we can explain 95% of the total variance. Lets train the PCA again considering as a PCA number a 95% variance explained cut-off.

In [34]:
pca_pipe = Pipeline([('std', StandardScaler()),
                    ('pca', PCA(n_components=0.95))])
In [35]:
train_pca = pca_pipe.fit_transform(X_res)
In [36]:
train_pca.shape
Out[36]:
(102618, 101)
In [37]:
vali_pca = pca_pipe.transform(X_vali)
test_pca = pca_pipe.transform(X_test)

Logistic Regression

In [71]:
lr = LogisticRegression( solver = 'liblinear',
                        max_iter= 2500)
In [72]:
%%time
# train model
lr.fit(train_pca, y_res)
Wall time: 2min 55s
Out[72]:
LogisticRegression(max_iter=2500, solver='liblinear')
In [73]:
y_res.shape
Out[73]:
(102618,)
In [74]:
print('Logistic Regression')
print(f'Train AUC score: {round(roc_auc_score(y_res, lr.predict_proba(train_pca)[:, 1]),2)}')
print(f'Vali AUC score: {round(roc_auc_score(y_vali, lr.predict_proba(vali_pca)[:, 1]),2)}')
print(f'Test AUC score: {round(roc_auc_score(y_test, lr.predict_proba(test_pca)[:, 1]),2)}')
Logistic Regression
Train AUC score: 0.88
Vali AUC score: 0.74
Test AUC score: 0.74
In [84]:
# Inicialize dictionary to append fpr, tpr and roc_auc

def calculate_roc_fpr_tpr(train_data, test_data, valid_data, model):
    fpr = dict()
    tpr = dict()
    roc_auc = dict()

    # Calcute for train
    y_pred_proba_train = model.predict_proba(train_data)[:,1]
    fpr['train'], tpr['train'], _ = roc_curve(y_res,  y_pred_proba_train)
    roc_auc['train'] = roc_auc_score(y_res, y_pred_proba_train)

    # Calculat for test
    y_pred_proba_test = model.predict_proba(test_data)[:,1]
    fpr['test'], tpr['test'], _ = roc_curve(y_test,  y_pred_proba_test)
    roc_auc['test'] = roc_auc_score(y_test, y_pred_proba_test)

    # Calculat for validation
    y_pred_proba_vali = model.predict_proba(valid_data)[:,1]
    fpr['vali'], tpr['vali'], _ = roc_curve(y_vali,  y_pred_proba_vali)
    roc_auc['vali'] = roc_auc_score(y_vali, y_pred_proba_vali)
    
    return fpr, tpr, roc_auc

fpr, tpr, roc_auc = calculate_roc_fpr_tpr(train_pca, test_pca, vali_pca, model = lr)
In [81]:
def plot_auc_curve(fpr, tpr, roc_auc, title_model = ''):

    fig = go.Figure()
    fig.add_shape(
        type='line', line=dict(dash='dash', color = GRAY9),
        x0=0, x1=1, y0=0, y1=1
    )

    fig.add_trace(go.Scatter(x=fpr['train'], y=tpr['train'], 
                             name='train - AUC ='+str(round(roc_auc['train'], 2)),
                             mode='lines',
                             line = dict( color = BLUE2, width = 4)))

    fig.add_trace(go.Scatter(x=fpr['test'], y=tpr['test'], 
                             name='test - AUC = '+str(round(roc_auc['test'], 2)), 
                             mode='lines',
                             line = dict( color = ORANGE1, width = 4)))

    fig.add_trace(go.Scatter(x=fpr['vali'], y=tpr['vali'], 
                             name='vali - AUC = '+str(round(roc_auc['vali'], 2)), 
                             mode='lines',
                             line = dict( color = RED1, width = 4)))


    fig.update_layout(
        title='AUC ' + title_model,
        xaxis_title='False Positive Rate',
        yaxis_title='True Positive Rate',
        template = "plotly_white",
        width=800, height=600,
        font_family="Arial",
        font_size = 20,
        font_color= GRAY6,
        title_font_family="Arial Bold",
        title_font_size = 25,
        title_font_color= GRAY3,
        legend_title_font_color=GRAY6
    )

    fig.update_layout(legend=dict(
        yanchor="top",
        y=0.3,
        xanchor="left",
        x=0.6
    ))

    fig.show()
    
    
plot_auc_curve(fpr, tpr, roc_auc, title_model = 'Logistic Regression')

It can be seen that the performance on train data is greatly better when compared with the test set. In order to improve model we will perform the tuning.

Logistic Regression - Model Tuning

In [45]:
%%time

# define parameters
solvers = ['sag']
penalty = ['l2']
c_values = np.logspace(0, 3, 25)

# define grid search
grid = dict(solver=solvers,penalty=penalty,C=c_values)
cv = StratifiedKFold(n_splits=5,                      
                    shuffle = True)

lr_grid = GridSearchCV(estimator=LogisticRegression(tol= 1e-3, max_iter= 5000), 
                           param_grid=grid,
                           n_jobs=-1, 
                           cv=cv, 
                           scoring= 'roc_auc',
                      verbose = 2)

lr_result = lr_grid.fit(train_pca, y_res)
Fitting 5 folds for each of 25 candidates, totalling 125 fits
Wall time: 12min 35s
In [46]:
lr_grid.best_score_
Out[46]:
0.8699348853798016
In [47]:
lr_grid.best_params_
Out[47]:
{'C': 749.8942093324558, 'penalty': 'l2', 'solver': 'sag'}
In [67]:
print('Logistic Regression')

print(f'Train AUC score: {round(roc_auc_score(y_res, lr_grid.predict_proba(train_pca)[:, 1]),3)}')
print(f'Vali AUC score: {round(roc_auc_score(y_vali, lr_grid.predict_proba(vali_pca)[:, 1]),3)}')
print(f'Test AUC score: {round(roc_auc_score(y_test, lr_grid.predict_proba(test_pca)[:, 1]),3)}')
Logistic Regression
Train AUC score: 0.871
Vali AUC score: 0.751
Test AUC score: 0.787

We will try to use a specific validation test to be use on grid search.

In [53]:
split_index_pca = [-1]*len(train_pca) + [0]*len(X_vali)

X_new = np.concatenate((train_pca, vali_pca), axis=0)
y_new = np.concatenate((y_res, y_vali), axis=0)

print(X_new.shape)
print(y_new.shape)
(113172, 98)
(113172,)
In [54]:
%%time
cv_split_pca = PredefinedSplit(test_fold= split_index_pca)

lr_grid_split = GridSearchCV(estimator=LogisticRegression(tol= 1e-3, max_iter= 5000), 
                           param_grid=grid,
                           n_jobs=-1, 
                           cv=cv_split_pca, 
                           scoring= 'roc_auc',
                      verbose = 2)

lr_grid_split.fit(X_new, y_new)
Fitting 1 folds for each of 25 candidates, totalling 25 fits
Wall time: 3min 35s
Out[54]:
GridSearchCV(cv=PredefinedSplit(test_fold=array([-1, -1, ...,  0,  0])),
             estimator=LogisticRegression(max_iter=5000, tol=0.001), n_jobs=-1,
             param_grid={'C': array([   1.        ,    1.33352143,    1.77827941,    2.37137371,
          3.16227766,    4.21696503,    5.62341325,    7.49894209,
         10.        ,   13.33521432,   17.7827941 ,   23.71373706,
         31.6227766 ,   42.16965034,   56.23413252,   74.98942093,
        100.        ,  133.35214322,  177.827941  ,  237.13737057,
        316.22776602,  421.69650343,  562.34132519,  749.89420933,
       1000.        ]),
                         'penalty': ['l2'], 'solver': ['sag']},
             scoring='roc_auc', verbose=2)
In [55]:
print(f'Best score: {lr_grid_split.best_score_}.')
print(f'Best parameters: {lr_grid.best_params_}')
Best score: 0.7514533319381747.
Best parameters: {'C': 749.8942093324558, 'penalty': 'l2', 'solver': 'sag'}

Best parameters:

  • C: 750

Best model training:

In [89]:
lr_best = LogisticRegression( C = 750, 
                             penalty = 'l2', 
                             solver = 'sag' ,
                             tol= 1e-3, 
                             max_iter= 5000
                            )

lr_best.fit(train_pca, y_res)
Out[89]:
LogisticRegression(C=750, max_iter=5000, solver='sag', tol=0.001)
In [90]:
print('Logistic Regression - Best Model')

print(f'Train AUC score: {round(roc_auc_score(y_res, lr_best.predict_proba(train_pca)[:, 1]), 3)}')
print(f'Vali AUC score: {round(roc_auc_score(y_vali, lr_best.predict_proba(vali_pca)[:, 1]),3)}')
print(f'Test AUC score: {round(roc_auc_score(y_test, lr_best.predict_proba(test_pca)[:, 1]),3)}')
Logistic Regression - Best Model
Train AUC score: 0.875
Vali AUC score: 0.753
Test AUC score: 0.752
In [91]:
fpr, tpr, roc_auc = calculate_roc_fpr_tpr(train_pca, test_pca, vali_pca, model = lr_best)
plot_auc_curve(fpr, tpr, roc_auc, title_model= 'Logistic Regression Tunned')

Naive Bayes

In [68]:
from sklearn.naive_bayes import GaussianNB
In [69]:
nb = GaussianNB()
In [70]:
nb.fit(train_pca, y_res)
Out[70]:
GaussianNB()
In [71]:
print(f'Train AUC score: {round(roc_auc_score(y_res, nb.predict_proba(train_pca)[:, 1]),2)}')
print(f'Test AUC score: {round(roc_auc_score(y_test, nb.predict_proba(test_pca)[:, 1]),2)}')
Train AUC score: 0.56
Test AUC score: 0.54

XGBoost

Because XGBoost doesn't really need for the features to be scaled and centered we will do a first try with pre-processing and without pre-processing.

Without Scale and PCA

In [38]:
# Supported tree methods are `gpu_hist`, `approx`, and `hist`.
clf = xgb.XGBClassifier(
    tree_method="gpu_hist", 
    objective = 'binary:logistic', #binary:logitraw
    eval_metric = 'auc',
    sampling_method = 'gradient_based',
    n_jobs = -1,
    # Default Parameters
    learning_rate = 0.3,
    max_depth = 6,
    colsample_bytree =1,
    subsample = 1,
    min_split_loss = 0,
    min_child_weight = 1
)

# X is the dataframe we created in previous snippet
clf.fit(X_res, y_res)
Out[38]:
XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
              early_stopping_rounds=None, enable_categorical=False,
              eval_metric='auc', gamma=0, gpu_id=0, grow_policy='depthwise',
              importance_type=None, interaction_constraints='',
              learning_rate=0.3, max_bin=256, max_cat_to_onehot=4,
              max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,
              min_split_loss=0, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=-1, num_parallel_tree=1,
              predictor='auto', random_state=0, reg_alpha=0, ...)
In [96]:
print('XGBoost without PCA, Scale and Center')

print(f'Train AUC score: {round(roc_auc_score(y_res, clf.predict_proba(X_res)[:, 1]),3)}')
print(f'Validation AUC score: {round(roc_auc_score(y_vali, clf.predict_proba(X_vali)[:, 1]),3)}')
print(f'Test AUC score: {round(roc_auc_score(y_test, clf.predict_proba(X_test)[:, 1]),3)}')
XGBoost without PCA, Scale and Center
Train AUC score: 0.992
Validation AUC score: 0.806
Test AUC score: 0.778

With Scale and PCA

In [40]:
# Supported tree methods are `gpu_hist`, `approx`, and `hist`.
clf_pre = xgb.XGBClassifier(
    tree_method="gpu_hist", 
    objective = 'binary:logistic', #binary:logitraw
    eval_metric = 'auc',
    sampling_method = 'gradient_based',
    n_jobs = -1,
    # Default Parameters
    learning_rate = 0.3,
    max_depth = 6,
    colsample_bytree =1,
    subsample = 1,
    min_split_loss = 0,
    min_child_weight = 1
)

# X is the dataframe we created in previous snippet
clf_pre.fit(train_pca, y_res)
Out[40]:
XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
              early_stopping_rounds=None, enable_categorical=False,
              eval_metric='auc', gamma=0, gpu_id=0, grow_policy='depthwise',
              importance_type=None, interaction_constraints='',
              learning_rate=0.3, max_bin=256, max_cat_to_onehot=4,
              max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,
              min_split_loss=0, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=-1, num_parallel_tree=1,
              predictor='auto', random_state=0, reg_alpha=0, ...)
In [41]:
print('XGBoost with PCA, Scale and Center')

print(f'Train AUC score: {round(roc_auc_score(y_res, clf_pre.predict_proba(train_pca)[:, 1]),3)}')
print(f'Validation AUC score: {round(roc_auc_score(y_vali, clf_pre.predict_proba(vali_pca)[:, 1]),3)}')
print(f'Test AUC score: {round(roc_auc_score(y_test, clf_pre.predict_proba(test_pca)[:, 1]),3)}')
XGBoost with PCA, Scale and Center
Train AUC score: 0.989
Validation AUC score: 0.776
Test AUC score: 0.757

The model that uses all variables without pre-processing presented better results, so we will tune the model in the next subchapter.

XGBoost Tuning

In [64]:
estimator = xgb.XGBClassifier(
    tree_method="gpu_hist", 
    objective = 'binary:logistic',
    eval_metric = 'auc',
    sampling_method = 'gradient_based',
    n_jobs = -1
)

parameters = {
    'max_depth': range (2, 6, 1),
    'n_estimators': range(50, 500, 50),
    'learning_rate': [0.001, 0.01, 0.05, 0.1, 0.3],
    'colsample_bytree' : [0.8, 0.85, 0.9, 0.95, 1],
    'colsample_bylevel': [0.8, 0.85, 0.9, 0.95, 1],
    'subsample' : [0.5, 0.6, 0.7, 0.8, 0.9, 1],
    'min_split_loss' : [0, 0.5, 2, 5, 10, 15]
}
In [65]:
split_index = [-1]*len(X_res) + [0]*len(X_vali)

X_new2 = np.concatenate((X_res, X_vali), axis=0)
y_new2 = np.concatenate((y_res, y_vali), axis=0)

cv_split = PredefinedSplit(test_fold= split_index)

print(X_new2.shape)
print(y_new2.shape)
(113300, 305)
(113300,)
In [66]:
%%time
xgb_grid = RandomizedSearchCV(
    estimator=estimator,
    param_distributions=parameters,
    n_iter = 1000,
    scoring = 'roc_auc',
    n_jobs = -1,
    cv = cv_split,
    verbose = 3
)

xgb_grid.fit(X_new2, y_new2)
Fitting 1 folds for each of 1000 candidates, totalling 1000 fits
Wall time: 36min 10s
Out[66]:
RandomizedSearchCV(cv=PredefinedSplit(test_fold=array([-1, -1, ...,  0,  0])),
                   estimator=XGBClassifier(base_score=None, booster=None,
                                           callbacks=None,
                                           colsample_bylevel=None,
                                           colsample_bynode=None,
                                           colsample_bytree=None,
                                           early_stopping_rounds=None,
                                           enable_categorical=False,
                                           eval_metric='auc', gamma=None,
                                           gpu_id=None, grow_policy=None,
                                           importance_type=None,
                                           interaction_...
                                           reg_alpha=None, reg_lambda=None, ...),
                   n_iter=1000, n_jobs=-1,
                   param_distributions={'colsample_bylevel': [0.8, 0.85, 0.9,
                                                              0.95, 1],
                                        'colsample_bytree': [0.8, 0.85, 0.9,
                                                             0.95, 1],
                                        'learning_rate': [0.001, 0.01, 0.05,
                                                          0.1, 0.3],
                                        'max_depth': range(2, 6),
                                        'min_split_loss': [0, 0.5, 2, 5, 10,
                                                           15],
                                        'n_estimators': range(50, 500, 50),
                                        'subsample': [0.5, 0.6, 0.7, 0.8, 0.9,
                                                      1]},
                   scoring='roc_auc', verbose=3)
In [68]:
xgb_grid.best_params_
Out[68]:
{'subsample': 0.9,
 'n_estimators': 150,
 'min_split_loss': 10,
 'max_depth': 5,
 'learning_rate': 0.05,
 'colsample_bytree': 1,
 'colsample_bylevel': 1}

The best model has the following parameters:

  • 'subsample': 0.9
  • 'n_estimators': 150
  • 'min_split_loss': 10
  • 'max_depth': 5
  • 'learning_rate': 0.05
  • 'colsample_bytree': 1
  • 'colsample_bylevel': 1
In [69]:
print('XGBoost - Tuned Model')

print(f'Train AUC score: {round(roc_auc_score(y_res, xgb_grid.predict_proba(X_res)[:, 1]),3)}')
print(f'Vali AUC score: {round(roc_auc_score(y_vali, xgb_grid.predict_proba(X_vali)[:, 1]),3)}')
print(f'Test AUC score: {round(roc_auc_score(y_test, xgb_grid.predict_proba(X_test)[:, 1]),3)}')
XGBoost - Tuned Model
Train AUC score: 0.981
Vali AUC score: 0.82
Test AUC score: 0.803
In [87]:
fpr, tpr, roc_auc = calculate_roc_fpr_tpr(X_res, X_test, X_vali, model = xgb_grid)
plot_auc_curve(fpr, tpr, roc_auc, title_model= 'XGBoost Tunned')
In [ ]:
 

Although the model is still overfitting, we we able to improve significantly the Validation and Test score.

LightGBM

In [92]:
import lightgbm as lgb
In [100]:
lgbm_class = lgb.LGBMClassifier(boosting_type ='gbdt', 
                                max_depth= -1,
                                num_leaves= 31,
                                learning_rate= 0.1,
                                n_estimators= 100,
                                objective= 'binary',
                                min_child_samples= 20,
                                colsample_bytree= 1.0,
                                subsample= 1.0,
                                n_jobs= -1,
                                importance_type= 'split',
                                reg_alpha= 0.0,
                                reg_lambda= 0.0,
                                silent= 'warn'
                                
)

lgbm_class.fit(X_res, y_res)
Out[100]:
LGBMClassifier(objective='binary')
In [101]:
print('LightGBM')

print(f'Train AUC score: {round(roc_auc_score(y_res, lgbm_class.predict_proba(X_res)[:, 1]),3)}')
print(f'Validation AUC score: {round(roc_auc_score(y_vali, lgbm_class.predict_proba(X_vali)[:, 1]),3)}')
print(f'Test AUC score: {round(roc_auc_score(y_test, lgbm_class.predict_proba(X_test)[:, 1]),3)}')
LightGBM
Train AUC score: 0.988
Validation AUC score: 0.808
Test AUC score: 0.797

LightGBM tunning

In [121]:
lgbm_estimator = lgb.LGBMClassifier(boosting_type ='gbdt', 
                                    objective= 'binary',
                                    n_jobs= -1, 
                                    silent= 'warn',
                                    importance_type= 'split'
                                   )
                                 
                             
parameters = {
    'max_depth': range(2, 6, 1),
    'num_leaves' : range(5, 50, 2),
    'n_estimators': range(50, 500, 50),
    'learning_rate': [0.001, 0.01, 0.03, 0.05, 0.075, 0.1, 0.3],
    'min_child_samples' : np.arange(20, 100, 5),
    'colsample_bytree' : np.arange(0.8, 1, 0.02),
    'subsample' : np.arange(0.5, 1, 0.05),
    'reg_alpha': np.arange(0, 1, 0.1),
    'reg_lambda': np.arange(0, 1, 0.1)    
}
In [122]:
%%time
light_grid = RandomizedSearchCV(
    estimator=lgbm_estimator,
    param_distributions=parameters,
    n_iter = 10000,
    scoring = 'roc_auc',
    n_jobs = -1,
    cv = cv_split,
    verbose = 3
)

light_grid.fit(X_new2, y_new2)
Fitting 1 folds for each of 10000 candidates, totalling 10000 fits
Wall time: 1h 55min 30s
Out[122]:
RandomizedSearchCV(cv=PredefinedSplit(test_fold=array([-1, -1, ...,  0,  0])),
                   estimator=LGBMClassifier(objective='binary'), n_iter=10000,
                   n_jobs=-1,
                   param_distributions={'colsample_bytree': array([0.8 , 0.82, 0.84, 0.86, 0.88, 0.9 , 0.92, 0.94, 0.96, 0.98]),
                                        'learning_rate': [0.001, 0.01, 0.03,
                                                          0.05, 0.075, 0.1,
                                                          0.3],
                                        'max_depth': range(2, 6),
                                        'min_child_samples': array([20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95]),
                                        'n_estimators': range(50, 500, 50),
                                        'num_leaves': range(5, 50, 2),
                                        'reg_alpha': array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]),
                                        'reg_lambda': array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]),
                                        'subsample': array([0.5 , 0.55, 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95])},
                   scoring='roc_auc', verbose=3)
In [123]:
print(f'Best LightGBM Score: {round(light_grid.best_score_,3 )}')
print(f'Best LightGBM Parameters: {light_grid.best_params_}')
Best LightGBM Score: 0.813
Best LightGBM Parameters: {'subsample': 0.9500000000000004, 'reg_lambda': 0.5, 'reg_alpha': 0.2, 'num_leaves': 39, 'n_estimators': 200, 'min_child_samples': 25, 'max_depth': 5, 'learning_rate': 0.03, 'colsample_bytree': 0.8800000000000001}

Best LightGBM parameters:

  • 'subsample': 0.95
  • 'reg_lambda': 0.5
  • 'reg_alpha': 0.2
  • 'num_leaves': 39
  • 'n_estimators': 200
  • 'min_child_samples': 25
  • 'max_depth': 5
  • 'learning_rate': 0.03
  • 'colsample_bytree': 0.88
In [124]:
print('LightGBM - Tunned')

print(f'Train AUC score: {round(roc_auc_score(y_res, light_grid.predict_proba(X_res)[:, 1]),3)}')
print(f'Validation AUC score: {round(roc_auc_score(y_vali, light_grid.predict_proba(X_vali)[:, 1]),3)}')
print(f'Test AUC score: {round(roc_auc_score(y_test, light_grid.predict_proba(X_test)[:, 1]),3)}')
LightGBM - Tunned
Train AUC score: 0.98
Validation AUC score: 0.82
Test AUC score: 0.806
In [222]:
fpr, tpr, roc_auc = calculate_roc_fpr_tpr(X_res, X_test, X_vali, model = light_grid)
plot_auc_curve(fpr, tpr, roc_auc, title_model= 'LightGBM Tunned')

Predict on test data

In [214]:
test =  pd.read_csv('data/test.csv')
test
Out[214]:
ID var3 var15 imp_ent_var16_ult1 imp_op_var39_comer_ult1 imp_op_var39_comer_ult3 imp_op_var40_comer_ult1 imp_op_var40_comer_ult3 imp_op_var40_efect_ult1 imp_op_var40_efect_ult3 ... saldo_medio_var29_ult3 saldo_medio_var33_hace2 saldo_medio_var33_hace3 saldo_medio_var33_ult1 saldo_medio_var33_ult3 saldo_medio_var44_hace2 saldo_medio_var44_hace3 saldo_medio_var44_ult1 saldo_medio_var44_ult3 var38
0 2 2 32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 40532.100000
1 5 2 35 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 45486.720000
2 6 2 23 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 46993.950000
3 7 2 24 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 187898.610000
4 9 2 23 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 73649.730000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
75813 151831 2 23 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 40243.200000
75814 151832 2 26 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 146961.300000
75815 151833 2 24 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 167299.770000
75816 151834 2 40 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 117310.979016
75817 151837 2 23 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 117310.979016

75818 rows × 370 columns

In [216]:
test_id = test.ID


test_data = test[final_col.drop('TARGET')]
In [217]:
test_data.shape
Out[217]:
(75818, 305)
In [219]:
df_submission = pd.DataFrame(data = {'ID' : test_id,
                                    'TARGET' : light_grid.predict_proba(test_data)[:, 1]})

df_submission
Out[219]:
ID TARGET
0 2 0.134656
1 5 0.128477
2 6 0.008862
3 7 0.055962
4 9 0.008765
... ... ...
75813 151831 0.235757
75814 151832 0.091952
75815 151833 0.021251
75816 151834 0.503607
75817 151837 0.013488

75818 rows × 2 columns

In [220]:
df_submission.to_csv('submission_v1.csv', index = False)

The final predictions were submitted to the kaggle and the we got the following scores:

  • Private Score: 0.79478
  • Public score: 0.81652